Instructions

The following tutorial will let you reproduce the plots that we are going to create at the workshop using R.
Please read carefully and follow the steps. Wherever you see the Code icon on the right you can click on it to see the actual code used in that section.

Introduction

This tutorial will focus on analysing the updated data of the worldwide Novel Corona virus (COVID-19) pandemic.
There are several data sources available online. We will use the data collected from a range of sources by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) and hosted on their GitHub repository. We will also use weekly COVID-19 data from the European Centre for Disease Prevention and Control website Note that both data sources are lagging 1-2 days behind our current date!!

Analyse Data in R

To run R and RStudio on Binder, click on this badge - Launch Rstudio Binder.

Start RStudio and create a new project named Workshop3 in a new folder (if you need a reminder ho to do it, check out Workshop1 Tutorial on BB).
Once RStudio restarts inside the project’s folder, create a new R script named Workshop3.R and 2 new folders, one named data for our input data and another named output for our plots.

Install Extra Packages

For this analysis we will again use some packages from the Tidyverse, but this time we load the specific packages (which are supposed to be pre-installed on your computers) to try and avoid having to download the entire tidyverse. In addition to the Tidyverse packages we’ve got to know in the previous workshop we will use the plotly package to create interactive plots, paletteer for custom color palettes, readxl to read MS-Excel file, scales to format large numbers, lubridate to better handle dates, glue to paste together strings, highcharter to plot the data on a world map and a few others to assist in getting the data into shape.
To install these packages, we will introduce a package called pacman that will assist in loading the required packages and installing them if they’re not already installed. To install it we use the install.packages('pacman') command, please note that the package name need to be quoted and that we only need to perform it once, or when we want or need to update the package. Once the package was installed, we can load its functions using the library(pacman) command and then load/install all the other packages at once with p_load() function.

# install required packages - needed only once! (comment with a # after first use)
install.packages('pacman')
# load required packages
library(pacman)
required_packages <- c("dplyr", "tidyr", "ggplot2", "paletteer",
                       "stringr", "readr","readxl", "glue",
                       "highcharter", "forcats", "scales",  
                       "plotly", "lubridate", "here", "ISOweek",
                       "countrycode")
p_load(char=required_packages)

More information on installing and using R packages can be found in this tutorial.

Read Data

Now that we’ve got RStudio up and running and our packages installed and loaded, we can read data into R from our local computer or from web locations using dedicated functions specific to the file type (.csv, .txt, .xlsx, etc.). Please download the data files from Blackboard and put them in the data folder.

We will use the read_csv() command/function from the readr package (part of the tidyverse) to load the data from a file on our computer into a variable of type data frame (table). If we don’t want to use external packages, we can use the read.csv() function from base R, which will slightly change the structure of the resulting data frame (will convert all text columns into factors and won’t automatically parse columns containing dates).
> Note that the path to the files can be either relative or absolute to our current working directory and that we use a forward slash / (Unix/Mac style) and not a backslash \ (Windows style), to make life easier, make use of RStudio’s auto-completion feature and the here package._

# read data from the 'data' folder
covid_data <- read_csv(here("data/csse_country_covid_data_19_03_2021.csv")) %>% 
  rename(Country_Region=`Country/Region`)

Data Exploration

Let’s use built-in functions for a brief data exploration, such as head() to show the first 10 rows of the data and str() for the type of data in each column:

#explore the data frame
head(covid_data) # show first 10 rows of the data and typr of variables
## # A tibble: 6 x 5
##   Country_Region Date       Confirmed Deaths Recovered
##   <chr>          <date>         <dbl>  <dbl>     <dbl>
## 1 Afghanistan    2020-01-22         0      0         0
## 2 Afghanistan    2020-01-23         0      0         0
## 3 Afghanistan    2020-01-24         0      0         0
## 4 Afghanistan    2020-01-25         0      0         0
## 5 Afghanistan    2020-01-26         0      0         0
## 6 Afghanistan    2020-01-27         0      0         0
str(covid_data) # show data structure
## tibble [81,216 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Country_Region: chr [1:81216] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Date          : Date[1:81216], format: "2020-01-22" "2020-01-23" ...
##  $ Confirmed     : num [1:81216] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Deaths        : num [1:81216] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Recovered     : num [1:81216] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Country/Region` = col_character(),
##   ..   Date = col_date(format = ""),
##   ..   Confirmed = col_double(),
##   ..   Deaths = col_double(),
##   ..   Recovered = col_double()
##   .. )

Descriptive Statistics

We can also produce some descriptive statistics to better understand the data and the nature of each variable. The summary() function (as can be guessed by its name) provides a quick summary of basic descriptive statistics, such as the mean, min, max and quantiles for continuous numerical values.

# summary of variables in my data
summary(covid_data)
##  Country_Region          Date              Confirmed            Deaths        
##  Length:81216       Min.   :2020-01-22   Min.   :       0   Min.   :     0.0  
##  Class :character   1st Qu.:2020-05-06   1st Qu.:     123   1st Qu.:     1.0  
##  Mode  :character   Median :2020-08-20   Median :    4012   Median :    71.0  
##                     Mean   :2020-08-20   Mean   :  195206   Mean   :  4967.4  
##                     3rd Qu.:2020-12-04   3rd Qu.:   54878   3rd Qu.:   932.2  
##                     Max.   :2021-03-19   Max.   :29730000   Max.   :541143.0  
##    Recovered       
##  Min.   :       0  
##  1st Qu.:      18  
##  Median :    1765  
##  Mean   :  113755  
##  3rd Qu.:   26098  
##  Max.   :11107332

We can see that most of our data contains ‘0’ (check the median of Confirmed column). Just to confirm that, let’s plot a histogram of all the confirmed cases

ggplot(covid_data, aes(x=Confirmed)) +
  geom_histogram(fill="lightskyblue") +
  theme_bw(16)

What are the metadata columns that describe our observations?

Country 
Date

The data is evolving over a time-series, to there’s no point treating it as a random population sample.

Time-series plot

Let’s look at confirmed cases data for the 10 most affected countries (to date). To find out these countries so we need to wrangle our data a little bit using the following steps:

  1. First we group it by Country with group_by()
  2. Then we sort it within each Country by Date (from latest to earliest) with arrange(desc(Date))
  3. We select just the most recent data with slice(1) and remove grouping with ungroup()
  4. Next we arrange it by descending order of confirmed cases and select the top 10
  5. We extract just the Country information as a vector of values using the $ sign
  6. Finally, we subset our original data to contain just the countries from our vector with filter()

Optional step:

  1. We can reorder the countries so they will be ordered in the legend by the number of cases with fct_reorder()

Then we can look at the data as a table and make a plot with the number of cases in the y-axis and date in the x-axis.

# find the 10 most affected countries (to date)
latest_data <- covid_data %>% group_by(Country_Region) %>% arrange(desc(Date)) %>% slice(1) %>% ungroup() 
most_affected_countries <- latest_data %>% arrange(desc(Confirmed)) %>% slice(1:10) %>% .$Country_Region
# have a look at the data as a table
latest_data %>% arrange(desc(Confirmed)) %>% filter(Country_Region %in% most_affected_countries) 
## # A tibble: 10 x 5
##    Country_Region Date       Confirmed Deaths Recovered
##    <chr>          <date>         <dbl>  <dbl>     <dbl>
##  1 US             2021-03-19  29730000 541143         0
##  2 Brazil         2021-03-19  11871390 290314  10435864
##  3 India          2021-03-19  11555284 159558  11107332
##  4 Russia         2021-03-19   4388268  92704   4003682
##  5 United Kingdom 2021-03-19   4299200 126263     12130
##  6 France         2021-03-19   4242145  91833    285055
##  7 Italy          2021-03-19   3332418 104241   2671638
##  8 Spain          2021-03-19   3212332  72910    150376
##  9 Turkey         2021-03-19   2971633  29864   2788757
## 10 Germany        2021-03-19   2654734  74608   2413584
# subset just the data from the 10 most affected countries and order them from the most affected to the least one
most_affected_data <- covid_data %>% 
  filter(Country_Region %in% most_affected_countries) %>% 
  mutate(Country=fct_reorder(factor(Country_Region), Confirmed, .desc = TRUE))

# create a line plot the data
ggplot(most_affected_data, aes(x=Date, y=Confirmed, colour=Country)) +
  geom_line(size=1) + scale_y_continuous(labels=comma) + 
    scale_color_paletteer_d("ggsci::springfield_simpsons") +
  labs(color="Country") +
  theme_bw(16)

It’s a bit hard to figure out how the pandemic evolved because the numbers in US, Brazil and India are an order of magnitude larger than the rest (which are very close to each other). How can we make it more visible (and also extend the details of the Date scale)?

# create the plot
plot <- ggplot(most_affected_data, aes(x=Date, y=Confirmed, colour=Country)) +
  geom_line(size=0.6) + scale_y_log10(labels=comma) +
  scale_x_date(NULL,
    breaks = breaks_width("2 months"), 
    labels = label_date_short()) + 
    scale_color_paletteer_d("ggsci::springfield_simpsons") +
  labs(color="Country") +
  theme_bw(16)
# show an interactive plot
ggplotly(plot)
## Warning: Transformation introduced infinite values in continuous y-axis

Why did we get a warning message? How can we solve it? What can we infer from the graph (exponential increase)?

What happens when we take the log of 0?? Can we remove those 0s with the `filter()` function?
We can see a very similar trend for most countries and while the curve has flattened substantially in April last year, the numbers are still rising. It is also evident that Europe got hit by a second wave arount October last year.

Baseline comparison

To be able to compare the disease progress between countries, we could “normalise” the data to show the epidemic spread using a similar baseline, for example, counting the days since the 100th confirmed case in each country.

baselined_data <- most_affected_data %>% filter(Confirmed>=100000) %>% group_by(Country_Region) %>% 
  mutate(Days_since_100k=difftime(Date,dplyr::first(Date), units = "days")) %>% ungroup()

country_order <- baselined_data  %>% filter(Days_since_100k==300) %>% arrange(desc(Confirmed)) %>% 
  select(Country_Region)
# create the plot
plot <- ggplot(baselined_data, aes(x=Days_since_100k, y=Confirmed, colour=factor(Country, levels=country_order$Country_Region))) +
  geom_line(size=0.6) + scale_y_log10(labels=comma) +
    scale_color_paletteer_d("ggsci::springfield_simpsons") +
  labs(color="Country", x="Days since the 100,000th patient") +
  theme_bw(16)
# show an interactive plot
ggplotly(plot)

Analyse by Population size

We can also look at the change in the number of cases relative to the population size of each country. To do so we will read another dataset of weekly new cases collected by the European Centre for Disease Prevention and Control, which include the population of each country (as of 2019).
We can read it directly from their website as a .csv file (as below) or we can download the most recent daily COVID-19 data from the website (download the Excel file and place it in the data folder, then use read_excel() instead of read_csv()).
We need to wrangle this one up as well to make sure that the countries names are consistent (try to open the file in excel file and find the issue) and the dates are parsed correctly. We will then calculate the number of accumulated cases and deaths using mutate(Total_cases=cumsum(Cases), Total_Deaths=cumsum(Deaths) as well as the number of cases per million people in each country.

eu_data <- read_csv("https://opendata.ecdc.europa.eu/covid19/nationalcasedeath/csv", na = "") 
glimpse(eu_data)
## Rows: 24,612
## Columns: 10
## $ country          <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afgh...
## $ country_code     <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "...
## $ continent        <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "A...
## $ population       <dbl> 38928341, 38928341, 38928341, 38928341, 38928341, ...
## $ indicator        <chr> "cases", "cases", "cases", "cases", "cases", "case...
## $ weekly_count     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 12, 18, 80, 185, 308...
## $ year_week        <chr> "2020-01", "2020-02", "2020-03", "2020-04", "2020-...
## $ rate_14_day      <dbl> NA, 0.000000000, 0.000000000, 0.000000000, 0.00000...
## $ cumulative_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 16, 34, 114, 299, 60...
## $ source           <chr> "Epidemic intelligence, national weekly data", "Ep...
covid_weekly_data <- eu_data %>% 
  rename(Country=country) %>% select(!any_of(c("weekly_count", "rate_14_day"))) %>% 
  mutate(Country=str_to_title(gsub("_", " ", Country)),
         Date = ISOweek2date(sub("-([0-9]{2})", "-W\\1-5", year_week))) %>% # 
  pivot_wider(names_from = "indicator", values_from = "cumulative_count") %>% 
  filter(!grepl("(Total)", Country, fixed = TRUE)) %>% 
  group_by(Country) %>% arrange(Country, Date) %>% # slice(1) %>% 
  mutate(Total_cases=cases, Total_Deaths=deaths,
         Cases_per_mill=Total_cases*1e6/population) %>% 
  ungroup()

# use the line below if using the Excel file downloaded from the web page  
# read_excel("data/COVID-19-geographic-disbtribution-worldwide-2020-03-22.xlsx") 
  
  #        iso3c=countrycode(GeoId, origin = 'iso2c', destination = 'iso3c')) %>% 
  # ungroup()

Similar to what we did earlier, we’ll group the data by Country and sort it by Date to select the most recent data that we’ll use to select the most affected countries.

# find the 10 most affected countries and order them from the most affected to the least one
worst_by_pop_data <- covid_weekly_data %>% group_by(Country) %>% 
  arrange(desc(Date)) %>% slice(1) %>% 
  ungroup() %>% filter(population>100000) %>% arrange(desc(Cases_per_mill)) %>%
  slice(1:10) 

# subset the weekly data 
plot_data <- covid_weekly_data %>% filter(Country %in% worst_by_pop_data$Country) %>% 
  mutate(Country=factor(Country, levels = worst_by_pop_data$Country)) %>% 
  filter(Date>as.Date("2020-04-01"))
# create the plot
ggplot(plot_data, aes(x=Date, y=Cases_per_mill, colour=Country)) +
  geom_line(size=0.6) + scale_y_continuous(labels=comma) +
  scale_x_date(NULL,
    breaks = breaks_width("2 months"), 
    labels = label_date_short()) + 
    scale_color_paletteer_d("rcartocolor::Bold") + # ggsci::springfield_simpsons
  labs(x="Date", y="Cases per million people") +
  theme_bw(16)

# create output folder
dir.create("./output", showWarnings = FALSE)
# save the plot to pdf file
ggsave("output/cases_per_million_worst_countries.pdf", width=8, height = 6)

Questions

  1. Why didn’t we use the vector of countries that we identified earlier from the original dataset?
  2. What other metrics we could analyse?
  3. What we should take into account that might bias the results or the true status of the pandemic?
1. Because the country names do not necessarily match (check how USA appears in both datasets)  
2. Mortalities (Case Fatality Rate), cases per population density (population/area)?  
3. Suggestions?

We can look at the average cases by continent and how the pandemic spread across the globe and the different responses of the continents.
Check at the following BBC site and check roughly when lockdowns and travel restrictions were applied and plot it.

cont_weekly_data <- covid_weekly_data %>% group_by(continent, Date) %>% 
  summarise_at(c("Total_cases", "Total_Deaths"), ~mean(., na.rm=TRUE)) %>% filter(Total_cases>1) %>% 
  ungroup()
# create the plot
cont_plot <- ggplot(cont_weekly_data, aes(x=Date, y=Total_cases, colour=continent)) +
  geom_line(size=0.6) + 
  geom_segment(aes(x = dmy("15/03/2020"), xend = dmy("15/03/2020"), y = 1, yend = 100000), colour="red", linetype ="dashed", size=0.25) + 
  scale_y_log10(labels=comma) +
    scale_color_paletteer_d("rcartocolor::Bold") +
  scale_x_date(NULL,
    breaks = scales::breaks_width("2 months"), 
    labels = scales::label_date_short() ) +
  labs(color="Continent", y="Mean confirmed cases") +
  theme_bw(16)
# show an interactive plot
ggplotly(cont_plot)

Maps

We can use the same data to visualise the impact of COVID-19 on a global/regional map to grasp its worldwide effect.

Choroplet Map by Country

We will create a color scale to represent the number of cases and appropriate popup information tags for each country.

# download map data
world_map <- download_map_data("custom/world-palestine-highres")
mapdata <- get_data_from_map(world_map)

# select only most current data
choroplet_data <- covid_weekly_data %>%  group_by(Country) %>% arrange(desc(Date)) %>% 
  slice(1) %>% filter(Total_cases>0) %>% 
  mutate(log_cases=log(Total_cases),
         CFR=scales::percent(Total_Deaths/Total_cases, accuracy=.01),
         Total_cases=scales::comma(Total_cases, accuracy=1), 
         Total_Deaths = scales::comma(Total_Deaths, accuracy=1), 
         Cases_per_mill=scales::number(Cases_per_mill, accuracy = .01,big.mark = ","), 
         pop=scales::comma(population, accuracy=1),
         geoID=countrycode(country_code, origin = 'iso3c', destination = 'iso2c', 
                           warn = FALSE),
         geoID=case_when(country_code=="XKX"~"KV",
                         geoID=="UK"~"GB",
                         geoID=="EL"~"GR",
                         geoID=="PS"~"WE",
                         geoID=="JPG11668"~"UM",
                         TRUE~geoID))


# # define colors

bins <- 10^(0:8)# c(1, 10, 100, 1000, 10000, 100000, 1000000)

cols <- as.character(paletteer_c("viridis::inferno", length(bins), direction = -1))
stops <- data.frame(q=1:length(bins)/length(bins), c=cols) %>% list_parse2(.)

# plot map
  highchart(type = "map", hc_opts = list(caption=glue("<b>Number of confirmed COVID-19 cases by Country</b><br/>An up-to-date summary of daily data obtained from the European Centre for Disease Prevention and Control<br/>Data was last updated on {format(max(choroplet_data$Date), '%d/%m/%Y')}."))) %>%
    hc_add_series_map(map = world_map, df = choroplet_data,
                      value = "log_cases", joinBy = c("iso-a2", "geoID")) %>%
    hc_colorAxis(stops  = color_stops(length(bins), cols), tickPositions=log(bins), showLastLabel=FALSE, labels=list(formatter=JS("function(){ 
    var n=Math.exp(this.value);
    if (n < 1e3) return n.toFixed(0);
    if (n >= 1e3 && n < 1e6) return +(n / 1e3).toFixed(1) + 'k';
    if (n >= 1e6 && n < 1e9) return +(n / 1e6).toFixed(1) + 'M';
    if (n >= 1e9 && n < 1e12) return +(n / 1e9).toFixed(1) + 'B';
    if (n >= 1e12) return +(n / 1e12).toFixed(1) + 'T';}"))) %>%
    hc_tooltip(useHTML=TRUE,headerFormat='',
               pointFormat = '{point.Country} confirmed cases : <span style="     border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: gold !important;" >{point.Total_cases}</span><br/>Deaths: <span style="     color: white !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: orangered !important;" >{point.Total_Deaths} (CFR {point.CFR})</span><br/>Population (2019): <b>{point.pop}</b><br/>Cases per million: <b>{point.Cases_per_mill}</b>') %>%
    hc_mapNavigation(enabled = TRUE) %>%
  hc_title(text = "Number of confirmed COVID-19 cases by Country",
           margin = 40, align = "left",
           style = list(color = "#2b908f", useHTML = TRUE)) %>%
  hc_subtitle(text = glue('Data origin: <a href="https://www.ecdc.europa.eu/en/publications-data/data-national-14-day-notification-rate-covid-19">European Centre for Disease Prevention and Control</a> (updated on {format(max(choroplet_data$Date), "%d/%m/%Y")})'),
              align = "left",
              style = list(color = "#2b908f", fontWeight = "bold")) %>%
  hc_exporting(enabled = TRUE)

What else can we plot or investigate?

1. Mortalities (Case Fatality Rate), cases per population density (population/area)?  
2. Hospitalisation data  
3. Vaccination data  

Additional Resources

  • Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) data repository and website
  • EU 14-days COVID-19 data for download in CSV/Excel format link
  • Be awesome in ggplot2: A Practical Guide to be Highly Effective - R software and data visualization (link)
  • COVID-19 vaccination data in Our World in Data site
  • My very own COVID-19 dashboard (created in R)

Contact

Please contact me at i.bar@griffith.edu.au for any questions or comments.